Skip to content

Improve CBO estimates for correlated columns#11324

Merged
sopel39 merged 5 commits intotrinodb:masterfrom
raunaqmorarka:cbo-correlation
Mar 11, 2022
Merged

Improve CBO estimates for correlated columns#11324
sopel39 merged 5 commits intotrinodb:masterfrom
raunaqmorarka:cbo-correlation

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

Description

Overall goal of the PR is to work towards enabling optimizer.default-filter-factor-enabled by default.
If default-filter-factor is enabled with existing implementation, it improves q18 and q21 on tpch significantly.
However, it also results in regressions on certain benchmark queries (tpcds partitioned q64, tpcds unpartitioned q78).
These changes update the estimation logic of filters and joins to address the problems
with underestimation of filter conjunctions and overestimation of multi-clause joins observed
when default-filter-factor is enabled with existing implementation.

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Query optimizer

How would you describe this change to a non-technical end user or system administrator?

Improves CBO estimates in the presence of hard to estimate terms.

Related issues, pull requests, and links

Picks first (n-1) commits from #11066

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Improve CBO estimates in the presence of correlated columns.

@cla-bot cla-bot Bot added the cla-signed label Mar 4, 2022
@raunaqmorarka
Copy link
Copy Markdown
Member Author

TPC benchmark results with SF1000 ORC
Unpartitioned
Correlation changes sf1000 orc unpartitioned.pdf
Partitioned
Correlation changes sf1000 orc partitioned.pdf

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Mar 4, 2022

prefix already approved in #11066

@raunaqmorarka raunaqmorarka force-pushed the cbo-correlation branch 2 times, most recently from 613370d to eb33a4a Compare March 9, 2022 15:34
@findepi
Copy link
Copy Markdown
Member

findepi commented Mar 9, 2022

Improve CBO estimates for correlated columns

Is it about correlation in a sense of query outer scope references, or ... ?

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Improve CBO estimates for correlated columns

Is it about correlation in a sense of query outer scope references, or ... ?

This is about correlation in the data between columns (e.g. nation, city)

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Results with latest PR
Correlated changes sf1000 orc partitioned.pdf
Ignore tpch/q05 result in partitioned run, the report is missing the number for that due to some glitch in benchmarking infra, there was no change to that query in the actual run.
Correlation changes sf1000 unpartitioned orc.pdf

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments. Did tests changed?

Comment thread core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java Outdated
Comment thread core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java Outdated
Comment thread core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java Outdated
Comment thread core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java Outdated
Comment thread core/trino-main/src/test/java/io/trino/cost/TestFilterStatsCalculator.java Outdated
Currently we assume that there is no correlation between
the terms of a filter conjunction. This can result in underestimation
as there is often some correlation between columns in real data sets.
In particular, predicates inferred on the build side relation through
a join with a partitioned table are often correlated with user provided
predicates on the build side.
Estimation for filter conjunctions now applies an exponential decay
to the selectivity of each successive term to reduce chances of
under estimation.
optimizer.filter-conjunction-independence-factor is added to allow
tuning the strength of the independence assumption.
Currently we assume that there is perfect correlation between
the clauses of a join and use the most selective clause for
driving output row count estimation. This can result in overestimation
as it not necessary that columns in join keys are perfectly correlated
in real data sets.
Estimation for multi clause joins now applies an exponential decay
to the selectivity of each successive term to reduce chances of
over estimation.
optimizer.join-multi-clause-independence-factor is added to allow
tuning the strength of the independence assumption.
List<Symbol> expressionSymbols = expressionUniqueSymbols.get(term);
int expressionPartitionId;
if (expressionSymbols.isEmpty()) {
expressionPartitionId = symbolPartitions.size(); // For expressions with no symbols
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: -1 instead?

@sopel39 sopel39 merged commit 48d3fe3 into trinodb:master Mar 11, 2022
@sopel39 sopel39 mentioned this pull request Mar 11, 2022
@raunaqmorarka raunaqmorarka deleted the cbo-correlation branch March 11, 2022 14:00
@github-actions github-actions Bot added this to the 374 milestone Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants